GROUP WORK - DEADLINE 15-Dec-23.
Please submit your final report using this form.

Radon, a natural radioactive gas, is a known carcinogen that can cause lung cancer, especially in high concentrations. In the United States, it’s estimated to lead to thousands of lung cancer deaths annually. Radon levels in homes across the country vary significantly, with some houses having dangerously high concentrations.

To pinpoint areas with high radon exposure, the Environmental Protection Agency conducted radon measurements in over 80,000 randomly selected houses nationwide. Our objective in analyzing this data was to estimate the range of radon levels in each of the approximately 3000 U.S. counties. This information would help homeowners decide whether to measure or reduce radon in their houses based on local conditions.

For the analysis, we organized the data hierarchically: houses within counties. If we were to consider multiple measurements within houses, it would create a three-level hierarchy of measurements, houses, and counties.

Creating a reproducible lab report

This time, you will need to use your own RStudio environment. You can also use one of the spaces you used for a previous lab, if you like!

  • Load the tidyverse, table1, GGally, broom, broom.mixed, lmerTest, faux, marginaleffects and the modelsummary packages, and any other packages you might deem useful.
  • Load the radon.rds data set into your workspace, and assign it into an object called finches.
radon <- read_rds("https://bit.ly/BMLR-Radon")
  • Knit your file to see that everything is working.
  • For each question, please add the text of the question, the code chunks and your answer.

Simulating the data

Simulating data is an important step when analyzing data, as it serves a range of purposes. First, it helps clarify our assumptions and expectations regarding treatment effects we might realistically observe, the types of interactions we should consider, the size and features of the data we need to establish the claims we want to make. We can work in a “clean” environment where variables are truly normal, there are no missing values and data is “well behaved”. Once we grapple with an ideal world we can slowly add complexities and see how they change the picture.

Secondly, simulating “fake” data is a general method to understand how statistical methods work when repeatedly sampling data. By automating the sampling processes, we can evaluate the model’s performance. We can define the parameters in our “fake” population and compare the models estimations and predictions with those true values, and we can include features to challenge our assumptions and improve our understanding. We simulate in order to test and validate our models before applying them to real data.

Third, simulating fake data helps find issues in our code. If your model struggles to recover true parameters with large samples or low data variance, it could signal a coding or conceptual problem. Overlaying simulated data with assumed and fitted models can aid in identifying these issues.

Finally, simulating synthetic data is crucial when planning new studies or collecting data. It provides a basis for what to expect, ensuring a better understanding of potential outcomes.

In the first part of this lab we are going to replicate some of the ideas presented in a post written by Michael Clark. You may consider reading the post before continuing.

We are going to use the faux package to simulate nested data (observations nested within clusters). First, we define the number of clusters n_cluster and the number of observations in each cluster, obs_per_cluster. For example, we might want to have 20 clusters with 4 observations in each cluster (to keep things simple, let’s stick to a “balanced” data-set).

obs_per_cluster <-  4
n_cluster <-  20

Now, let’s generate some data, where (level 1) observed outcomes are related to predictors in the following manner:

\[ Y = a_i + b_i\cdot X + \epsilon \\ \text{where:}~~~ \epsilon \sim \mathcal{N}(0, \sigma^2) \]

But rather than being regular population parameters, \(a_i\) and \(b_i\) are random intercepts and slopes, such that each cluster has its own, unique intercept and slope. The random intercept is a level 2 random effect:

\[ a_i = \alpha_0 + u_i \\ \text{where:}~~~u_i \sim \mathcal{N}(0, \sigma_u^2)\\ \] And the random slope is a level 2 random effect:

\[ b_i = \beta_0 + v_i \\ \text{where:}~~~v_i \sim \mathcal{N}(0, \sigma_v^2)\\ \] To keep things simple, we will make sure that the random intercepts and slopes are independent, that is, the correlation between \(u_i, v_i\) is zero \(\rho_{u,v}=0\).

As you can see, in this stage we have only one predictor at level one \(X\). We have no predictor at level 2. We also have to define six different population parameters: \(\alpha_0 =1, \beta_0=0.5\) and three different standard deviations: \(\sigma=1, \sigma_u=0.5, \sigma_v=0.25, \rho_{u,v}=0\).

Feel free to change these values if you would like to simulate different data.

# Fixed intercept
alpha_0 <-  1  

# Fixed slope
beta_0 <- .5  

# Residuals st.dev (level 1) 
sig  <-  1

# Random intercept st.dev (level 2) 
sig_u <-  .5

# Random slope st.dev (level 2)
sig_v <-  .25

# Independent random effects
cor <-  0

Finally, here is the code that simulates our data in several steps. Run every step separately, and observe the data frame View(df) to its contents

# Create a dataset that can be reproduced
set.seed(888)  
# step 1: create the units 
df <- add_random(
    cluster = n_cluster,  # first the clusters
    obs = obs_per_cluster # second, the observations
    ) 

# Print the result 
# try to figure out what just happened
df |> 
  print(n = 15)

# step 2: Add anything at level 2
# Here we have random intercept/slope 
df <- df |>
  add_ranef(
    "cluster",
    u_i = sig_u,
    v_i = sig_v,
    .cors = cor
    ) |> mutate(
      a_i = alpha_0 + u_i, 
      b_i = beta_0  + v_i
    )

# Print the result 
# try to figure out what just happened
df |> 
  print(n = 15)


# step 3: add anything at level 1
df <- df |>
  mutate(
    x       = rnorm(n_cluster * obs_per_cluster),
    epsilon = rnorm(n_cluster * obs_per_cluster, 0, sig),
    y = a_i + b_i * x + epsilon
    )


# Print the result 
# try to figure out what just happened
df |> 
  print(n = 20)

# For the models, we really only need 
# the cluster id, the y and the x

df  <- df |> select(cluster, x, y)

# Print the result 
# try to figure out what just happened
df |> 
  print(n = 20)

So now we have our final observations: the clusters, the x and the y. Before continuing, please go back, set obs_per_cluster and n_cluster to larger numbers, to make sure our models work. For example, you might want to create a data set with 100 clusters and 10 observations per cluster. You may want to experiment with other sample or cluster sizes or change the six population parameters. Feel free to experiment.

We will then test the data against three types of models:

  • Complete pooling: Here we ignore the structure of the data, and pretend that all the observations are uncorrelated and independent of one another. The model disregards the clusters and is oblivious to the observations within clusters being more similar to one another than observations between clusters. We will be estimating just two fixed population parameters: the one intercept and one slope.
  • No pooling: Here we estimate two dummy variable for each cluster (except the reference cluster): one for the intercept and one for the slope. For example, if we had 100 clusters we would be estimating one intercept and 99 dummy variables (fixed effects), which is essentially like having 100 different intercepts, one for each cluster.
  • Partial pooling: Here we estimate two dummy variable for each cluster (except the reference cluster): one for the intercept and one for the slope. For example, if we had 100 clusters we would be estimating one intercept and 99 dummy variables (fixed effects), which is essentially like having 100 different intercepts, one for each cluster.
  1. Adjust the parameters, simulate your clustered data and present the results of the three models using the modelsummary package. Interpret the three models and compare the estimations with each of the six parameters you use in order to create your dataset. Is there one single model that is superior to all the rest in terms of estimating the parameters? If so which is it? If not, Which model works best for what purpose? In your answer, please pay attention to the estimated Intraclass Correlation Coefficient (ICC) of your multilevel model. What does the ICC mean? How much does it deviate from what you would expect?
Complete pooling No pooling Multilevel model
(Intercept) 1.017 (0.042)*** 2.184 (0.233)*** 1.009 (0.098)***
x 0.562 (0.044)*** 0.527 (0.040)*** 0.544 (0.050)***
clustercluster21 −1.056 (0.329)**
clustercluster22 −1.290 (0.329)***
clustercluster23 −0.715 (0.329)*
clustercluster24 −1.510 (0.329)***
Num.Obs. 800 800 800
R2 0.170 0.396
R2 Adj. 0.169 0.364
R2 Cond. 0.376
AIC 2551.1 2374.9 2423.9
BIC 2565.1 2571.7 2452.0
ICC 0.3
RMSE 1.19 1.01 0.99
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
  1. Use your models to create a catterpillar plot for the intercepts and lattice or spaghetti plots to show the features of your multilevel models. Describe and discuss the plots: what new light do they shed? What are the advantages or disadvantages of presenting the models using these plots?

Hints for creating the plots will be available in the following few days. Thank you for your patience.

Radon in the home: a case study

Our aim in examining this data is to estimate how radon levels are distributed across the approximately 3000 counties in the United States. This information can empower homeowners to decide whether to test and address radon in their homes, using the best local knowledge available.

To conduct this analysis, we view the data hierarchically: houses (level one) within counties (level 2). If we were to delve deeper into multiple measurements within houses, we could create a three-tier hierarchy: measurements (level 1), houses (level 2), and counties (level 3). During the analysis, we consider an important factor: where the measurement was taken, whether in the basement or on the first floor. Radon, originating underground, can more readily enter homes built into the ground.

Additionally, at the county level, we used a a measurement of soil uranium available for each county. This hierarchical model enables us to create a regression model for individual measurements while accommodating consistent, unexplained variations among the 3000 counties.

This section closely follows chapter 12 in Gellman and Hill’s influential book Data analysis using regression and multilevel/hierarchical models. You will find the chapter available as an optional reading (link also available through Brightspace). The chapter is available on Perusall to allow you to share code, discuss, or ask questions. I will be looking at the discussion from time to time, so if you need help, please do not hesitate to #tag me 🤙

In that chapter you will find the background for the study, and an explanation of the variables. The original dataset and the code used to produce the book is available here.

The dataset consists of a large number of measurements of the gas Radon, and so in this document we will be including only the measurements from the state of Minnesota. Feel free to choose a different state if you like.

radon_mn <- read_rds("radon.rds") |> 
  filter(state == "MN") |>
  # drop counties without measurements
  mutate(county = fct_drop(county))
  1. Create a table1 and use GGally::ggpairs to explore your data.

You will now visualize the estimated and confidence interval for the average log Radon across different counties, against the (jittered) number of measurements taken in said county. We will use the estimation to compare between two models: the no pooling model and the multilevel model with partial pooling, both without predictors. The horizontal line in each plot represents an estimate of the average radon level across all counties. This is a replication of figure 12.1 in the above mentioned chapter.

  1. Replicate the figures below and explain why the counties with fewer measurements have more variable estimates and larger higher standard errors. What is the main advantage and disadvantage of no-pooling analysis, as illustrated in the plot on the left compared with the multilevel model? Explain.

Hints for creating the plots will be available in the following few days. Thank you for your patience.

  1. In this exercise, you will replicate a set of models as shown below. This illustration is based on the ideas shown in figure 12.4 in the book. Please replicate this illustration with a twist: for example, you may choose a different set of counties than the one shown below or the one shown in the book, or you may try something different. Please explain the graphs. What do the dashed red and black lines show? What is the blue line? What determines whether the blue line would be more similar to the red line or the black line? How is this figure compare to the one shown in chapter 12 of the book?

Hints The three lines show the predictions of the three models: the complete pooling, the no pooling and the multilevel (partial pooling) models. Further hints for creating the plots will be available in the following few days. Thank you for your patience.

  1. Finally, we are going to add a fixed effect at the county level (level 2). This illustration is based on figure 12.6 in the book, and it shows the estimated county level coefficients, plotted vs. county level uranium levels, along with estimated multilevel regression line at level 2. After replicating this graph, please explain it briefly.

Hints for creating the plots will be available in the following few days. Thank you for your patience.